The five leading causes of death in the United States from 1999 to 2014 are cancer, heart disease, unintentional injury, chronic lower respiratory disease, and stroke. The dataset includes the U.S. Department of Health and Human Services public health regions. Therefore, we can investigate the leading causes of death of each region, and suggest accordingly public health policy and remedies.
The dataset we are using has 191748 observations of 11 variables. The variables include year, cause of death, state, age range, hhs region, locality, number of observed deaths, population, number of expected deaths, percent excess deaths, and number of potential excess deaths. The hhs region indicates the 10 U.S. Departtment of Health and Human Servoces public health regions.
The number of observed deaths, population, number of expected deaths, percent of excess deaths, and number of excess deaths are continuous variables, and others are categorical variables. In this report, we are going to focus on percent excess deaths for locality, and how the five leading causes is distributed to the percent excess death. Percent excess deaths for each State are calculated as following:
percent excess deaths = \(\frac{\text{observed death} - \text{expected death}} {\text{observed death}}\)
The percent excess deaths ranges from 0 to 85.3% with a mean of 35.66%. The locality is divided by metropolitan and nonmetropolitan areas.
From the dataset, we obtain the information that the number of potentially excess deaths from the five leading causes in rural areas was higher than those in urban areas.
We then analyzed several factors that might influence the rural-urban difference in potentially excess deaths from the five leading causes, many of which are associated with sociodemographic and ecological differences between rural and urban areas.
Through statistical analysis, our report provides an interactive and straightforward view on the potentially excess deaths from the five leading causes of death in non-metropolitan and metropolitan areas.
The ultimate goal is to bring attention to preventing deaths in the rural areas through improving healthcare services and public health programs.
cod_data = read_csv("./data/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv") %>%
clean_names() %>%
na.omit() %>%
filter(!(state == "United States")) %>%
separate(., percent_potentially_excess_deaths, into = c("percent_excess_death"), sep = "%") %>%
mutate(percent_excess_death = as.numeric(percent_excess_death), mortality = observed_deaths/population * 10000, mortality = as.numeric(mortality)) %>%
select(year, age_range, cause_of_death, state, locality, observed_deaths, population, expected_deaths, potentially_excess_deaths, percent_excess_death, mortality, hhs_region)
## Warning: Too many values at 191748 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9,
## 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
##columns removed
#"state_fips_code" "benchmark" "potentially_excess_deaths" "percent_excess_death" "mortality"
region_cod_data = cod_data %>%
select(state, locality, hhs_region, percent_excess_death) %>%
group_by(state,locality, hhs_region) %>%
summarise(mean_ped = mean(percent_excess_death)) %>%
dplyr::filter(!(state == "District of\nColumbia")) %>%
mutate(hhs_region = as.character(hhs_region))
After reviewing the dataset, we filter out entries with United States in state variable. Then we remove the percentage mark in the Percent Potentially Excess Deaths variable. We also create a variable called mortalityby calculating the quotient of observed death over population. Lastly we filter out entries with District of Columbia in region variable. After selecting the variables of interest, the data is ready for analysis. When we analyze the relationships between the outcomes(mortality, percent excess death, etc) and several covariates(year, cause of death, population, etc), we group the data by the covariates and calculate the mean outcome as response.
This interactive plot examines the differences of Standardized Mortality Ratio (SMR), the proportion between observed death and expected death, of the five causes of death in each public health region. Each observation indicates for which state it belongs to with the size of which is reflective of the corresponding population. Here we compare SMR within the same geographic area. The differences of the distribution of the SMR among five causes of deaths might associate with the sociodemographic, cultural, behavioral and structural factors of the specific region.
cod_data %>%
mutate(year = as.factor(year))
## # A tibble: 191,748 x 12
## year age_range cause_of_death state locality observed_deaths
## <fctr> <chr> <chr> <chr> <chr> <int>
## 1 2005 0-49 Cancer Alabama All 756
## 2 2005 0-49 Cancer Alabama Metropolitan 556
## 3 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 4 2005 0-49 Cancer Alabama All 756
## 5 2005 0-49 Cancer Alabama Metropolitan 556
## 6 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 7 2005 0-49 Cancer Alabama All 756
## 8 2005 0-49 Cancer Alabama Metropolitan 556
## 9 2005 0-49 Cancer Alabama Nonmetropolitan 200
## 10 2005 0-54 Cancer Alabama All 1346
## # ... with 191,738 more rows, and 6 more variables: population <int>,
## # expected_deaths <int>, potentially_excess_deaths <int>,
## # percent_excess_death <dbl>, mortality <dbl>, hhs_region <int>
p <- cod_data %>%
plot_ly(
x = ~expected_deaths,
y = ~observed_deaths,
size = ~population,
color = ~cause_of_death,
frame = ~hhs_region,
text = ~state,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
) %>%
layout(title = "Change of Standardized Mortality Ratio in National Public Health Regions", yaxis=list(title="Observed Death"),xaxis=list(title="Expected Death"))
p
<<<<<<< HEAD
=======